Quick memory optimisations #776

mauch · 2024-10-23T03:06:32Z

These changes stop the quadrupling of dataframe in memory when using iterrows before upload to the database. There are also a few other minor changes to squeeze out a bit more memory at no cost to processing time.

It runs for me through forced extraction with ~2700 images - and through the measurment pairs with ~1000 images. The pairs dataframe just gets too big after 1000 images.

These changes should be good to hold over a 1.x release (no changes to the databse structure etc.) until further improvements in v2.x release which won't be backwards compatible.

…the while dataframe in memory.

…rformance hit).

ddobie · 2024-10-25T02:52:25Z

CHANGELOG.md

@@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

 #### List of PRs

+- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): Minor memory optimisations


Suggested change

- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): Minor memory optimisations

- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): fix: Minor memory optimisations

Although I realise that I didn't adhere to this formatting in the previously listed PR!!

vast_pipeline/pipeline/forced_extraction.py

vast_pipeline/utils/utils.py

mauch · 2024-11-04T00:10:51Z

I've added some comments where you suggested. Also I've modified part of the revalidation step to save ~5GB of RAM for runs with ~2700 images.

ddobie

Looks good to me. I haven't run it myself, but you have and it passes tests so I'm happy to approve.

* Organise v1.1.1-dev * Fix changelog formatting and update changelog instructions (#772) * Initial changelog formatting issues * Update changelog + instructions * Updated changelog * Updated Code of conduct (#773) * Updated Code of conduct * Updated changelog * Fixed grammar * Fix zenodo DOI * Fixed typo in README * Shorten forced fit measurement names (#734) * Shorten names * Updated changelog * Update clearpiperun to use raw SQL (#775) * timing and memory benchmark * delete raw initial * adding profiler * optimisation handling exceptions * Added logging * Updated delete_run * Fix syntax errors * Disable triggers to see if that fixes speed issues * Remove memory profiling * Reenabled logging * Add end of loop logging, remove tqdm * Remove all tqdm, improve logging slightly * Added timing * Fixed tqdm missing * Fix logging * Added units to logging * specify source id in logging * Toggle triggers * clean up clearpiperun * Other minor updates * Fix variable name * Correctly handle images and skyregions that are associated with multiple runs * PEP8 * Updated changelog * Remove commented code * Remove whitespace - don't know why the linter didn't pick this up * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/management/commands/clearpiperun.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Update vast_pipeline/utils/delete_run.py Co-authored-by: Tom Mauch <[email protected]> * Fix logging count * Clean up logging statements --------- Co-authored-by: Shibli Saleheen <[email protected]> Co-authored-by: Tom Mauch <[email protected]> * Quick memory optimisations (#776) * Use itertuples over iterrows since iterrows is an enormous memory hog. * Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory. * Decrease default partition size to 15MB * Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit). * Don't write forced parquets in parallel (No perfomance hit for this). * Dont overwrite input DataFrame when writing parquets. * Update CHANGELOG.md * Address review comments. * Copy YAML objects before revalidation so the can be garbage collected. * Appease flake8 * 750 configure workers (#777) * Use itertuples over iterrows since iterrows is an enormous memory hog. * Drop sources_df columns before renaming id column to avoid a copy of the while dataframe in memory. * Decrease default partition size to 15MB * Dont split (large-in-memory) list of DataFrames into dask bags (No performance hit). * Don't write forced parquets in parallel (No perfomance hit for this). * Initial configuration updates for processing options. * Dont overwrite input DataFrame when writing parquets. * Update CHANGELOG.md * Address review comments. * Copy YAML objects before revalidation so the can be garbage collected. * Appease flake8 * Add processing options as optional with defaults. * filter processing config to parallel association. * Add a funtion to determine the number of workers and partitions for Dask. * Use config values for num_workers and max_partition_size throughout pipeline. * Correct working in config template. * Update CHANGELOG.md * Remove unused imports. * Bump strictyaml to 1.6.2 * Use YAML 'null' to create Python None for all cores option. * Make None the default in `calculate_workers_and_partitions` instead of 0 * Updated run config docs * Allow null for num_workers_io and improve validation of processing parameters. * Update num_workers_io default in docs. --------- Co-authored-by: Dougal Dobie <[email protected]> * Prepare v1.2.0 release --------- Co-authored-by: Shibli Saleheen <[email protected]> Co-authored-by: Tom Mauch <[email protected]>

mauch added 5 commits October 22, 2024 14:27

Use itertuples over iterrows since iterrows is an enormous memory hog.

27de763

Drop sources_df columns before renaming id column to avoid a copy of …

b4c6088

…the while dataframe in memory.

Decrease default partition size to 15MB

e180a4e

Dont split (large-in-memory) list of DataFrames into dask bags (No pe…

9e9668d

…rformance hit).

Don't write forced parquets in parallel (No perfomance hit for this).

d7e7b21

mauch requested a review from ddobie October 23, 2024 03:06

mauch added the enhancement New feature or request label Oct 23, 2024

mauch added 2 commits October 24, 2024 09:52

Dont overwrite input DataFrame when writing parquets.

0a08b7b

Update CHANGELOG.md

ff6269c

ddobie reviewed Oct 25, 2024

View reviewed changes

mauch added 4 commits October 29, 2024 13:37

Merge branch 'dev' into optimise_mem

2d82245

Address review comments.

67ea721

Copy YAML objects before revalidation so the can be garbage collected.

c98720b

Appease flake8

5c24a7d

mauch requested a review from ddobie November 4, 2024 00:09

ddobie approved these changes Nov 4, 2024

View reviewed changes

mauch merged commit 19883c2 into dev Nov 4, 2024
5 checks passed

mauch deleted the optimise_mem branch November 4, 2024 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quick memory optimisations #776

Quick memory optimisations #776

mauch commented Oct 23, 2024

ddobie Oct 25, 2024

mauch commented Nov 4, 2024

ddobie left a comment

		@@ -24,6 +25,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),

		#### List of PRs

		- [#776](https://github.com/askap-vast/vast-pipeline/pull/776): Minor memory optimisations

Quick memory optimisations #776

Quick memory optimisations #776

Conversation

mauch commented Oct 23, 2024

ddobie Oct 25, 2024

Choose a reason for hiding this comment

mauch commented Nov 4, 2024

ddobie left a comment

Choose a reason for hiding this comment